====================================================================================================================

PART ONE

====================================================================================================================

DOMAIN: Automobile

CONTEXT: The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes

DATA DESCRIPTION: The data concerns city-cycle fuel consumption in miles per gallon

Attribute Information:

1. mpg: continuous
2. cylinders: multi-valued discrete
3. displacement: continuous
4. horsepower: continuous
5. weight: continuous
6. acceleration: continuous
7. model year: multi-valued discrete
8. origin: multi-valued discrete
9. car name: string (unique for each instance)

1. Import and warehouse data:

• Import all the given datasets and explore shape and size.

• Merge all datasets onto one and explore final shape and size.

• Export the final dataset and store it on local machine in .csv, .xlsx and .json format for future use.

• Import the data from above steps into python.

cardata_csv has 398 rows and 1 column

cardata_json has 398 rows and 8 columns

Now we can check the data with head

• Merge all datasets onto one and explore final shape and size.

Since we have similar number of rows and data are there in two data file, we can merge with join function.

Now we have 398 rows and 9 columns are merging which added the columns

• Export the final dataset and store it on local machine in .csv, .xlsx and .json format for future use.

• Import the data from above steps into python.

2. Data cleansing:

• Missing/incorrect value treatment

• Drop attribute/s if required using relevant functional knowledge

• Perform another kind of corrections/treatment on the data.

Lets see the dristribution of hp

Year would be more effective if we can transorm this to calculate age of vehicle. Since the year of collection of data is not given lets consider max year as the final year

Origin as pointed earlier indicates production point so should be broken into dummy variables

3. Data analysis & visualisation:

• Perform detailed statistical analysis on the data.

• Perform a detailed univariate, bivariate and multivariate analysis with appropriate detailed comments after each analysis.

Descriptive analysis

MPG : Here we can see that the mpg mean and the median are almost same, which can say that the data is normally distributed. Very little skewness is there.

Following similar mean and median relationship, columns like cyl, disp, hp have skewness. While data like wt, acc, yr have very little skewness.

Univariate Analysis

Observations:

  1. acc,mpg are nearly normal
  2. cyl and disp shows 3 clusters, which indicates there might be 3 gaussians
  3. wt column shows 2 cluster which indicates 2 gaussian.
  4. From bivariate plots we can see that mps shows negative liner relationship with wt,hp and disp and same is seen in correlation.
  5. For age criteria we see two peaks
  6. Cyl too shows negative correlation with levels
  7. With increase in cylinder, mpg seems to be going down.
  8. With increase in displacement, mpg seems to be going down. This is as expected in real life scenario
  9. With increase in hp, mpg seems to be going down. This is as expected in real life scenario
  10. Acc vs mpg shows cloud like graph, which indicates weak relationship
  11. There is likelihood of 3 clusters.

There is high negative correlation between mpg and variable like cyl,disp,hp,wt

Bellow features are highly correlated among each other: cyl, disp, hp, wt

With increase in cylinder mpg seems to be going down.

With increase in displacement, mpg seems to be going down. This is as expected in real life scenario

With increase in hp, mpg seems to be going down. This is as expected in real life scenario

Acc vs mpg shows cloud like graph, which indicates weak relationship

Removing Outliers

Lets check for outliers

Using logaritmic transform for hp,mpg and acc to remove outliers

Other continuous variables should be checked for outliers and should be normlized using z-score

4. Machine learning:

• Use K Means and Hierarchical clustering to find out the optimal number of clusters in the data.

• Share your insights about the difference in using these two methods.

K-Mean

Now, we will use K-Means clustering to group data based on their attribute. First, we need to determine the optimal number of groups. For that we conduct the knee test to see where the knee happens.

We can see that after 4 node there is minor changes in error. Hence we can consider 4 clusters

Here we can see that silhouette_score is around 0.4176

Lets Compare silhouette_score for different n

WSS reduces as K keeps increasing

Cluster 0: Here mean mpg is lesser than group 1 and 2, with avg of 6.17 cyl

Cluster 1: These are vehicles with highest mean mpg

Cluster 2: These are on higher end of mean mpg but lesser than cluster 1

Cluster 3: These are vehicles with least mean mpg

Cluster 0: MEDIUM RANGE MPG CARS (medium mpg, medium wt, medium acceration): Here mean mpg is average or in between c1 and c2

Cluster 1: HIGH RANGE MPG CARS (high mpg, less wt, high acceration): These are vehicles with highest mean mpg, lesser weight and high acceleration

Cluster 2: SMALL RANGE MPG CARS (lower mpg, high wt, lower acceration): These are vehicles with least mean mpg but mean weight are more.

Hierarchical clustering

Cophenetic score for (n =3) =0.77894

Cophenetic score for (n =4) =0.7631

Since cluster 3 has btter cophentic coefficients than 4, we can go with n=3

Cluster 1: HIGH RANGE MPG CARS (high mpg, less wt, high acceration): These are vehicles with highest mean mpg, lesser weight and high acceleration

Cluster 2: SMALL RANGE MPG CARS (lower mpg, high wt, lower acceration): These are vehicles with least mean mpg but mean weight are more.

Cluster 3: MEDIUM RANGE MPG CARS (medium mpg, medium wt, medium acceration): Here mean mpg is average or in between c1 and c2

5. Answer below questions based on outcomes of using ML based methods.

• Mention how many optimal clusters are present in the data and what could be the possible reason behind it.

• Use linear regression model on different clusters separately and print the coefficients of the models individually

• How using different models for different clusters will be helpful in this case and how it will be different than using one single model without clustering? Mention how it impacts performance and prediction.

ANSWER

• Mention how many optimal clusters are present in the data and what could be the possible reason behind it.

ANSWER

KMean :

The elbow plot confirms our visual analysis that there are likely 3 or 4 good clusters

Here we have considered both 3 and 4 clusters :

#

FOUR cluster :

RESULT :

Cluster 0: Here mean mpg is lesser than group 1 and 2, with avg of 6.17 cyl

Cluster 1: These are vehicles with highest mean mpg

Cluster 2: These are on higher end of mean mpg but lesser than cluster 1

Cluster 3: These are vehicles with least mean mpg

SCORES ::

FOR n= 4 ::
Inertia (error)= 818.4570329845308
Silhouette_score = 0.32626749496530005

#

THREE cluster :

RESULT :

Cluster 0: MEDIUM RANGE MPG CARS (medium mpg, medium wt, medium acceration): Here mean mpg is average or in between c1 and c2

Cluster 1: HIGH RANGE MPG CARS (high mpg, less wt, high acceration): These are vehicles with highest mean mpg, lesser weight and high acceleration

Cluster 2: SMALL RANGE MPG CARS (lower mpg, high wt, lower acceration): These are vehicles with least mean mpg but mean weight are more.

SCORES ::

FOR n= 3 ::
Inertia (error)= 1014.5017454129314
Silhouette_score = 0.3387598488728533

#

Here we can see that either of 3 or 4 are good to consider, inertia(WSS) is better for n=4 but silhoutte score is lightly better for n=3

Both n=3 and n=4 have almost similar silhoutte scores

Hierachial clustering

Cophenet index is a measure of the correlation between the distance of points in feature space and distance on dendrogram.

Closer it is to 1, the better is the clustering

So Lets check Cophenet index for n=4 and n=3 ::

Cophenetic score for (n =4) =0.7631

Cophenetic score for (n =3) =0.77894

Since clusters =3 has better cophenetic coefficient, we can consider n =3 for hierarchial clustering

#

RESULTS FOR n=3 clustering :

Cluster 1: HIGH RANGE MPG CARS (high mpg, less wt, high acceration): These are vehicles with highest mean mpg, lesser weight and high acceleration

Cluster 2: SMALL RANGE MPG CARS (lower mpg, high wt, lower acceration): These are vehicles with least mean mpg but mean weight are more.

Cluster 3: MEDIUM RANGE MPG CARS (medium mpg, medium wt, medium acceration): Here mean mpg is average or in between c1 and c2

• Use linear regression model on different clusters separately and print the coefficients of the models individually

CUMULATIVE LINEAR REGRESSION RESULTS (WITHOUT CLUSTERING):


LR ::

LR model coefficients : [-0.08035158 0.1527018 -0.36583637 -0.51641442 -0.11818455 -0.34181765 0.23006978 0.17320095]

Train accuracy 0.885111127269854

Test accuracy 0.8915530561721962


Lasso ::

lasso model coefficients :
[-0. -0. -0.16974501 -0.54838154 0. -0.25161161 0. 0. ]

Train accuracy 0.8568484665363305

Test accuracy 0.8797222061844738


Ridge ::

Ridge model coefficients: [-0.07687941 0.13920723 -0.36653269 -0.50845284 -0.11985127 -0.34063802 0.22395737 0.16872845]

Train accuracy 0.8851022297292884

Test accuracy 0.891660998648879

===============================

NEXT We need to create create copy of unscaled data, group the clusters and then scale it

As grouping of already scaled data will not be normalised in newly grouped set

• How using different models for different clusters will be helpful in this case and how it will be different than using one single model without clustering? Mention how it impacts performance and prediction.

ANSWER:

As we can see that after the clustering we are seeing many of the feature's coeficients are becoming 0. Which indicates they are no longer contributing much in predicting mpg.

After comparing the acurracy of different models, we can see that different models fits on different clusters differently. For example :

    CLUSTER 01 : Ridge is giving the best results.
    CLUSTER 02 : Linear regression and Ridge is giving better results, while Lasso is failing.
    CLUSTER 03 : Lasso is giving better results while Linear regression and Ridge are failing here.

After clustering fewer attributes contributes more toward predicting the mpg for a given cluster. For example :

    CLUSTER 01 : hp and acc are contributing more toward determining the result as per coefficients
    CLUSTER 02 : disp, wt, age are contributing more toward determining the result as per coefficients
    CLUSTER 03 : disp, wt, age are contributing more toward determining the result as per coefficients

Clustering here can helps in reducing the dimensionality here as few features are more important for given cluster than others and it helps in faster execution

OBSERVATIONS:

  1. The mpg column for the different brand names are a suspect. Found values much larger than the factory values for those cars! Definition of mpg too may have to be looked at.

  2. The weight of the car too is a suspect as they differed from the specifications for those models. There are different types of weights. Was the data collected consistently

  3. The HP column too had values different from the factory specifications. There are different types of HP values. Was the a standard definition followed

6. Improvisation:

• Detailed suggestions or improvements or on quality, quantity, variety, velocity, veracity etc. on the data points collected by the company to perform a better data analysis in future.

Here we chave very less data on each cluster we finaly get, when we are applying the model after clustering, we are left with very few data

Data is imbalanced here

Many of the features have high correlation between then which is not good.

Missing data points

For those instances where the declared mpg is greater than factory mpg, replace with factory mpg. Similarly for other columns. When this was done the standard distribution for the mpg column fell by 50%

====================================================================================================================

END PART ONE

====================================================================================================================

====================================================================================================================

PART TWO

====================================================================================================================

DOMAIN: Manufacturing

CONTEXT: Company X curates and packages wine across various vineyards spread throughout the country.

DATA DESCRIPTION: The data concerns the chemical composition of the wine and its respective quality.

Attribute Information:

  1. A, B, C, D: specific chemical composition measure of the wine

  2. Quality: quality of wine [ Low and High ]

PROJECT OBJECTIVE: Goal is to build a synthetic data generation model using the existing data provided by the company.

Steps and tasks:

  1. Design a synthetic data generation model which can impute values [Attribute: Quality] wherever empty the company has missed recording the data.

Solutioning Steps:

Read the excel into dataframe

Check shape of dataframe

There are 2 type of quality: Quality A and B

Lets Drop the NaN values as we dont have quality for them for processing purpose

Converting quality data to numerical(0->Quality A, 1->Quality B

Extracting the features

Scaling Data

Clustering with K Mean and find optimal number of clusters with elbow method

Since the most of the unique cluster in data is 2 (Quality A and B) and the elbow is significant at 2 clusters, we can take number of cluster as 2

Here we can see that the accuracy is 1. Which is good

Lets Predict the values for NaN data

  1. Scaled the initial df2 numerical data
  2. predict the values with model
  3. Replace back 0: "Quality A" and 1: "Quality B"

As we can see that the clustering model is very good with predicting the values(Accuracy=1), we can say that the cluster is good for Nan values

====================================================================================================================

END PART TWO

====================================================================================================================

====================================================================================================================

PART THREE

====================================================================================================================

DOMAIN: Automobile

CONTEXT: The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.

DATA DESCRIPTION: The data contains features extracted from the silhouette of vehicles in different angles. Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars.

• All the features are numeric i.e. geometric features extracted from the silhouette.


PROJECT OBJECTIVE: Apply dimensionality reduction technique – PCA and train a model using principal components instead of training the model using just the raw data.

Steps and tasks:

  1. Data: Import, clean and pre-process the data
  2. EDA and visualisation: Create a detailed performance report using univariate, bi-variate and multivariate EDA techniques. Find out all possible hidden patterns by using all possible methods.

    For example: Use your best analytical approach to build this report. Even you can mix match columns to create new columns which can be used for better analysis. Create your own features if required. Be highly experimental and analytical here to find hidden patterns.

  3. Classifier: Design and train a best fit SVM classier using all the data attributes.
  4. Dimensional reduction: perform dimensional reduction on the data.
  5. Classifier: Design and train a best fit SVM classier using dimensionally reduced attributes.
  6. Conclusion: Showcase key pointer on how dimensional reduction helped in this case.

1. Data: Import, clean and pre-process the data

As we can see that out of 846 rows, 41 are null. Hence we will remove the row data with null value

Now we have 813 rows and 19 columns

Lets check for if any other unique values like ?, - or other character

2. EDA and visualisation

Observations: As we can see that there are large number of pair of data with high correlations.

3. Classifier: Design and train a best fit SVM classier using all the data attributes.

• Segregate predictors vs target attributes

1. x: features
2. y: target variables 

4. Dimensional reduction: perform dimensional reduction on the data.

5. Classifier: Design and train a best fit SVM classier using dimensionally reduced attributes.

6. Conclusion: Showcase key pointer on how dimensional reduction helped in this case.

model Train accuracy Test accuracy dimension
SVM 98.39 97.85 18
SVM with PCA transform 95 93.55 7

We can see that even after reducing the dimentionality from 18 to 7, we are getting 93.55% accuracy with test data and 95% accuracy in training data.

This is minor change in acurracy but dimension is getting reduced by 11

====================================================================================================================

END PART THREE

====================================================================================================================

====================================================================================================================

PART FOUR

====================================================================================================================

DOMAIN: Sports management

CONTEXT: Company X is a sports management company for international cricket.

DATA DESCRIPTION: The data is collected belongs to batsman from IPL series conducted so far. Attribute Information:

  1. Runs: Runs score by the batsman
  2. Ave: Average runs scored by the batsman per match
  3. SR: strike rate of the batsman
  4. Fours: number of boundary/four scored
  5. Six: number of boundary/six scored
  6. HF: number of half centuries scored so far

PROJECT OBJECTIVE: Goal is to build a data driven batsman ranking model for the sports management company to make business decisions.

Steps and tasks:

  1. EDA and visualisation: Create a detailed performance report using univariate, bi-variate and multivariate EDA techniques. Find out all possible hidden patterns by using all possible methods.
  2. Build a data driven model to rank all the players in the dataset using all or the most important performance features.

SOLUTIONING :

AS we can see that the all the row with null values have each and every columns as NaN, hence we can remove the row as doesnot contains any data

1. EDA and visualisation

Runs : The runs distribution is slighly skewed toward right

AVG : The AVG distribution has few outliers

SR : The SR distribution has few outlier and is skewed toward left

Four : The Four distribution is slighly skewed toward right

Sixes : The runs distribution is slighly skewed toward right with few outliers

HF : The HF distribution is slighly skewed toward right with few outliers

VERY HIGH CORRELATED DATA (|COEF|>=0.9):

[VERY HIGH CORRELATION] Runs and Fours : 0.91880860633387

VERY HIGH CORRELATED DATA (|COEF|>0.8 and <0.9):

[HIGH CORRELATION]Runs and HF have high correlation of : 0.8351477368906668

AS we can see few correlation between features, we can use PCA to remove dimension further

2. Build a data driven model to rank all the players in the dataset using all or the most important performance features.

Methods To be Used :

  1. Grouping using K mean
  2. Ranking the players using PCA

Grouping using K mean

Ranking the players using PCA

Lets start with 6 dimension

Dimensionality Reduction

Now 4 dimensions seems very reasonable. With 4 variables we can explain over 95% of the variation in the original data.

CONCLUSION

As above we have sorted the based on performance score which using 4 dimension after PCA.

With 4 variables we can explain over 95% of the variation in the original data.

After sorting the players we can see that the data are failrly good in predicting the performance of a player.

For Eg :

CH Gayle has very high run, avg, SR, Fours, Sixes and HF

S Dhawan has next to CH Gayle's records which justifies the ranking

====================================================================================================================

END PART FOUR

====================================================================================================================

====================================================================================================================

PART FIVE

====================================================================================================================

• Questions:

1. List down all possible dimensionality reduction techniques that can be implemented using python.

2. So far you have used dimensional reduction on numeric data. Is it possible to do the same on a multimedia data [images and video] and text data ? Please illustrate your findings using a simple implementation on python.

ANSWERS

Q1. List down all possible dimensionality reduction techniques that can be implemented using python.

ANS Q1:

Find Combination of new features:
    Linear methods :  
        Principal Component Analysis (PCA)
        Factor Analysis (FA)
        Linear Discriminant Analysis (LDA)
        Truncated Singular Value Decomposition (SVD)

    Non-linear methods (Manifold learning) :
        Kernel PCA
        t-distributed Stochastic Neighbor Embedding (t-SNE)
        Multidimensional Scaling (MDS)
        Isometric mapping (Isomap)
        Generalized discriminant analysis (GDA)

Only Keep the important features:
        Backward Elimination
        Forward Selection
        Random forests



Q2. So far you have used dimensional reduction on numeric data. Is it possible to do the same on a multimedia data [images and video] and text data ? Please illustrate your findings using a simple implementation on python.

ANS Q2:

  **STEPS:**
    1. Import images into python using PIL or any other python image library
    2. Display the image - actual
    3. Display the image - matrix (hint: Image is a MXN matrix of number )
    4. SL algorithms require a MxN dataframe to classify. Whereas here a single image is MxN, hence making it difficult for SL algorithm to intake data.
    solution: Flatten each image i.e. MxN ---> 1X(M*N)
    5. Apply SL algorithm like KNN or SVM or any other algorithm of your choice. Note the accuracy (A1)
    6. Use the image matrix to perform pPCA on it.
    7. Apply the same SL algorithm as used above on the dimensionally reduced data . Note the accuracy (A2)
    8. Compare A1 ~ A2. If they are similar then dimesional reduction has worked on image.

CONCLUSION :

After comparison before and after compression and comparing the accuracies, we can see that we have compressed the image with PCA and the accuracy is similar and higher side.

They are similar and dimensional reduction has worked on image.

With the given illustration we have demostrated the use of dimensional reduction on multimedia data like image.

====================================================================================================================

END PART FIVE

====================================================================================================================

====================================================================================================================

END

====================================================================================================================